Principle Component Analysis is widely used in data exploration, dimension reduction, data visualization. The aim is to transform original data into uncorrelated linear combinations of the original data while keeping the information contained in the data. High dimensional data tends to show clusters in lower dimensional view.
Clustering Analysis is another form of EDA. Here we are hoping to group data points which are close to each other within the groups and far away between different groups. Clustering using PC’s can be effective. Clustering analysis can be very subjective in the way we need to summarize the properties within each group.
Both PCA and Clustering Analysis are so called unsupervised learning. There is no response variables involved in the process.
For supervised learning, we try to find out how does a set of predictors relate to some response variable of the interest. Multiple regression is still by far, one of the most popular methods. We use linear models as a working model for its simplicity and interpretability. It is important that we use domain knowledge as much as we can to determine the form of the response as well as the function format of the factors.
Self-esteem generally describes a person’s overall sense of self-worthiness and personal value. It can play significant role in one’s motivation and success throughout the life. Factors that influence self-esteem can be inner thinking, health condition, age, life experiences etc. We will try to identify possible factors in our data that are related to the level of self-esteem.
In the well-cited National Longitudinal Study of Youth (NLSY79), it follows about 13,000 individuals and numerous individual-year information has been gathered through surveys. The survey data is open to public here. Among many variables we assembled a subset of variables including personal demographic variables in different years, household environment in 79, ASVAB test Scores in 81 and Self-Esteem scores in 81 and 87 respectively.
The data is store in NLSY79.csv.
Here are the description of variables:
Personal Demographic Variables
Household Environment
Variables Related to ASVAB test Scores in 1981
| Test | Description |
|---|---|
| AFQT | percentile score on the AFQT intelligence test in 1981 |
| Coding | score on the Coding Speed test in 1981 |
| Auto | score on the Automotive and Shop test in 1981 |
| Mechanic | score on the Mechanic test in 1981 |
| Elec | score on the Electronics Information test in 1981 |
| Science | score on the General Science test in 1981 |
| Math | score on the Math test in 1981 |
| Arith | score on the Arithmetic Reasoning test in 1981 |
| Word | score on the Word Knowledge Test in 1981 |
| Parag | score on the Paragraph Comprehension test in 1981 |
| Numer | score on the Numerical Operations test in 1981 |
Self-Esteem test 81 and 87 (two tests that were made)
We have two sets of self-esteem test, one in 1981 and the other in 1987. Each set has same 10 questions. They are labeled as Esteem81 and Esteem87 respectively followed by the question number. For example, Esteem81_1 is Esteem question 1 in 81.
The following 10 questions are answered as 1: strongly agree, 2: agree, 3: disagree, 4: strongly disagree (thermometrics)
Load the data. Do a quick EDA to get familiar with the data set. Pay attention to the unit of each variable. Are there any missing values?
Student Answer:
NLSY79<-read.csv("./data/NLSY79.csv")
str(NLSY79) # data format## 'data.frame': 2431 obs. of 46 variables:
## $ Subject : int 2 6 7 8 9 13 16 17 18 20 ...
## $ Gender : chr "female" "male" "male" "female" ...
## $ Education05 : int 12 16 12 14 14 16 13 13 13 17 ...
## $ Income87 : int 16000 18000 0 9000 15000 2200 27000 20000 28000 27000 ...
## $ Job05 : chr "4700 TO 4960: Sales and Related Workers" "10 TO 430: Executive, Administrative and Managerial Occupations" "7900 TO 8960: Setters, Operators and Tenders" "5000 TO 5930: Office and Administrative Support Workers" ...
## $ Income05 : int 5500 65000 19000 36000 65000 8000 71000 43000 120000 64000 ...
## $ Weight05 : int 160 187 175 246 180 235 160 188 173 130 ...
## $ HeightFeet05 : int 5 5 5 5 5 6 5 5 5 5 ...
## $ HeightInch05 : int 2 5 9 3 6 0 4 10 9 4 ...
## $ Imagazine : int 1 0 1 1 1 1 1 1 1 1 ...
## $ Inewspaper : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Ilibrary : int 1 1 1 1 1 1 1 1 1 1 ...
## $ MotherEd : int 5 12 12 9 12 12 12 12 12 12 ...
## $ FatherEd : int 8 12 12 6 10 16 12 15 16 18 ...
## $ FamilyIncome78: int 20000 35000 8502 7227 17000 20000 48000 15000 4510 50000 ...
## $ Science : int 6 23 14 18 17 16 13 19 22 21 ...
## $ Arith : int 8 30 14 13 21 30 17 29 30 17 ...
## $ Word : int 15 35 27 35 28 29 30 33 35 28 ...
## $ Parag : int 6 15 8 12 10 13 12 13 14 14 ...
## $ Number : int 29 45 32 24 40 36 49 35 48 39 ...
## $ Coding : int 52 68 35 48 46 30 58 58 61 54 ...
## $ Auto : int 9 21 13 11 13 21 11 18 21 18 ...
## $ Math : int 6 23 11 4 13 24 17 21 23 20 ...
## $ Mechanic : int 10 21 9 12 13 19 11 19 16 20 ...
## $ Elec : int 5 19 11 12 15 16 10 16 17 13 ...
## $ AFQT : num 6.84 99.39 47.41 44.02 59.68 ...
## $ Esteem81_1 : int 1 2 2 1 1 1 2 2 2 1 ...
## $ Esteem81_2 : int 1 1 1 1 1 1 2 2 2 1 ...
## $ Esteem81_3 : int 4 4 3 3 4 4 3 3 3 3 ...
## $ Esteem81_4 : int 1 2 2 2 1 1 2 2 2 1 ...
## $ Esteem81_5 : int 3 4 3 3 1 4 3 3 3 3 ...
## $ Esteem81_6 : int 3 2 2 2 1 1 2 2 2 2 ...
## $ Esteem81_7 : int 1 2 2 3 1 1 3 2 2 1 ...
## $ Esteem81_8 : int 3 4 2 3 4 4 3 3 3 3 ...
## $ Esteem81_9 : int 3 3 3 3 4 4 3 3 3 3 ...
## $ Esteem81_10 : int 3 4 3 3 4 4 3 3 3 3 ...
## $ Esteem87_1 : int 2 1 2 1 1 1 1 2 1 1 ...
## $ Esteem87_2 : int 1 1 2 1 1 1 1 2 1 1 ...
## $ Esteem87_3 : int 4 4 4 3 4 4 4 3 4 4 ...
## $ Esteem87_4 : int 1 1 2 1 1 1 2 2 1 4 ...
## $ Esteem87_5 : int 2 4 4 4 4 4 4 3 4 4 ...
## $ Esteem87_6 : int 2 1 2 2 1 1 2 2 1 1 ...
## $ Esteem87_7 : int 2 2 2 1 1 2 2 2 2 1 ...
## $ Esteem87_8 : int 3 3 4 2 4 4 4 3 4 3 ...
## $ Esteem87_9 : int 3 2 3 2 4 4 3 3 3 4 ...
## $ Esteem87_10 : int 4 4 4 2 4 4 4 3 4 4 ...
attach(NLSY79)## The following object is masked from package:ISLR:
##
## Auto
## see if there are missing values
sum(is.na(NLSY79))## [1] 0
## Comment: some of the missing values do exist, but they are not na, e.g. they are spaces. But those instances only occur in the Job05 variable. But being unspecified doesn't necessarily suggest it's a na value. It might be because the respondents feel that there's no category for their job.As seen, there are no missing values in the data set.
Let concentrate on Esteem scores evaluated in 87.
data.esteem, then data.esteem[, c(1, 2, 4, 6, 7)] <- 5 - data.esteem[, c(1, 2, 4, 6, 7)] to reverse the score.)Student answer: reverse data score of Esteem 1, 2, 4, 6, and 7 respectively
Reverse it s.t. higher score corresponds to higher self esteem. (Here, we don’t change the original dataset, we simply change it in he stored data set of esteem data)
esteem_data<-NLSY79[,c(37:46)]
esteem_data[,c(1, 2, 4, 6, 7)] <- 5 - esteem_data[,c(1, 2, 4, 6, 7)]
colnames(esteem_data)<-c("Esteem87_1","Esteem87_2","Esteem87_3","Esteem87_4","Esteem87_5","Esteem87_6","Esteem87_7","Esteem87_8","Esteem87_9","Esteem87_10")Student Answer: the 10 esteem measurements by length:
Since we want to demonstrate the distribution of each of the 10 esteem measurements, a two-dimensional single plot may not suffice (it doesn’t display esteem scores’ distribution if we set the x-axis to be each of the 10 measurement).
So we resort to histogram plot with multiple groups:
p1 <- ggplot(esteem_data) +
geom_histogram(aes(x = Esteem87_1), bins = 10, fill = "blue") + labs( title = "Esteem87_1", x = "Score" , y = "Frequency")
p2 <- ggplot(esteem_data) +
geom_histogram(aes(x = Esteem87_2), bins = 10, fill = "blue") + labs( title = "Esteem87_2", x = "Score" , y = "Frequency")
p3 <- ggplot(esteem_data) +
geom_histogram(aes(x = Esteem87_3), bins = 10, fill = "blue") + labs( title = "Esteem87_3", x = "Score" , y = "Frequency")
p4 <- ggplot(esteem_data) +
geom_histogram(aes(x = Esteem87_4), bins = 10, fill = "blue") + labs( title = "Esteem87_4", x = "Score" , y = "Frequency")
p5 <- ggplot(esteem_data) +
geom_histogram(aes(x = Esteem87_5), bins = 10, fill = "blue") + labs( title = "Esteem87_5", x = "Score" , y = "Frequency")
p6 <- ggplot(esteem_data) +
geom_histogram(aes(x = Esteem87_6), bins = 10, fill = "blue") + labs( title = "Esteem87_6", x = "Score" , y = "Frequency")
p7 <- ggplot(esteem_data) +
geom_histogram(aes(x = Esteem87_7), bins = 10, fill = "blue") + labs( title = "Esteem87_7", x = "Score" , y = "Frequency")
p8 <- ggplot(esteem_data) +
geom_histogram(aes(x = Esteem87_8), bins = 10, fill = "blue") + labs( title = "Esteem87_8", x = "Score" , y = "Frequency")
p9 <- ggplot(esteem_data) +
geom_histogram(aes(x = Esteem87_9), bins = 10, fill = "blue") + labs( title = "Esteem87_9", x = "Score" , y = "Frequency")
p10 <- ggplot(esteem_data) +
geom_histogram(aes(x = Esteem87_10), bins = 10, fill = "blue") + labs( title = "Esteem87_10", x = "Score" , y = "Frequency")
gridExtra::grid.arrange(p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,ncol=2) ## facet two plots side by sideStudent Answer:
From the pairwise correlation matrix below, we observe that the correlation matrix’s all entries are positive. However, the extent of correlation ranges from 0.24 to 0.70. This is naturally because each score of the esteem-87 tests all measure esteem, where higher scores mean higher esteem.
corr_pairwise<-cor(esteem_data,method="pearson")
round(corr_pairwise,2)## Esteem87_1 Esteem87_2 Esteem87_3 Esteem87_4 Esteem87_5 Esteem87_6
## Esteem87_1 1.00 0.70 0.45 0.53 0.40 0.46
## Esteem87_2 0.70 1.00 0.44 0.55 0.40 0.48
## Esteem87_3 0.45 0.44 1.00 0.41 0.55 0.41
## Esteem87_4 0.53 0.55 0.41 1.00 0.38 0.51
## Esteem87_5 0.40 0.40 0.55 0.38 1.00 0.40
## Esteem87_6 0.46 0.48 0.41 0.51 0.40 1.00
## Esteem87_7 0.38 0.41 0.34 0.42 0.37 0.60
## Esteem87_8 0.27 0.28 0.35 0.30 0.38 0.41
## Esteem87_9 0.24 0.26 0.35 0.29 0.35 0.36
## Esteem87_10 0.31 0.33 0.46 0.37 0.44 0.44
## Esteem87_7 Esteem87_8 Esteem87_9 Esteem87_10
## Esteem87_1 0.38 0.27 0.24 0.31
## Esteem87_2 0.41 0.28 0.26 0.33
## Esteem87_3 0.34 0.35 0.35 0.46
## Esteem87_4 0.42 0.30 0.29 0.37
## Esteem87_5 0.37 0.38 0.35 0.44
## Esteem87_6 0.60 0.41 0.36 0.44
## Esteem87_7 1.00 0.39 0.35 0.39
## Esteem87_8 0.39 1.00 0.43 0.44
## Esteem87_9 0.35 0.43 1.00 0.58
## Esteem87_10 0.39 0.44 0.58 1.00
PCA on 10 esteem measurements. (centered but no scaling)
Student Answer: we perform PCA on the 10 esteem measurements variables.
pc_esteem.10<-prcomp(esteem_data,scale=FALSE) ## by default, center=True but scale=FALSE
names(pc_esteem.10) ## pc_esteem.4 is the PC scores of PC1, PC2, ..., PC10
## ten variables here so we have 10 PC loadings
pc_esteem.10$rotation[,c(1,2)]## [1] "sdev" "rotation" "center" "scale" "x"
## PC1 PC2
## Esteem87_1 0.235 -0.374
## Esteem87_2 0.244 -0.367
## Esteem87_3 0.279 -0.149
## Esteem87_4 0.261 -0.321
## Esteem87_5 0.312 -0.131
## Esteem87_6 0.313 -0.209
## Esteem87_7 0.299 -0.163
## Esteem87_8 0.393 0.332
## Esteem87_9 0.398 0.578
## Esteem87_10 0.376 0.260
PC1 Loadings and PC2 Loadings are displayed above, which are two tuples with ten entries. It describes the direction of the line.
b) Are there good interpretations for PC1 and PC2? (If loadings are all negative, take the positive loadings for the ease of interpretation)
Student Answer: Comment: If all loadings are all negative, then it would be exactly the same as taking the positive loadings.
PC1: It is a line that minimizes the total squared distance between the data points and the PC1 line (a line that goes through the origin, and it’s the linear combination of all 10 esteem measurements that minimize the total squared perpendicular distance.
It is approximately 0.235 Esteem87_1 + 0.244 Esteem87_2 + 0.279 Esteem87_3 + 0.261 Esteem87_4 + 0.312 Esteem87_5 + 0.313 Esteem87_6 + 0.299 Esteem87_7 + 0.393 Esteem87_8 + 0.398 Esteem87_9 + .376 Esteem87_10
It is proportional to the total centered and scaled 10 test scores. Higher PC1 scores suggest higher weighted score of esteem level.
PC2: Approximately proportional to the difference between the last three esteem scores (8,9,10) and the first seven esteem scores. If total scores are comparable, higher PC2 implies strong esteem response from the last three tests while lower PC2 implies vibrant esteem response from the first seven tests.
c) How is the PC1 score obtained for each subject? Write down the formula.
Student answer:
pc1_esteem<-pc_esteem.10$x[,c(1)] # computational formula for PC1 ScoreFormula: PC1_Score = 0.235 Esteem87_1 + 0.244 Esteem87_2 + 0.279 Esteem87_3 + 0.261 Esteem87_4 + 0.312 Esteem87_5 + 0.313 Esteem87_6 + 0.299 Esteem87_7 + 0.393 Esteem87_8 + 0.398 Esteem87_9 + 0.376 Esteem87_10
d) Are PC1 scores and PC2 scores in the data uncorrelated?
Student Answer:
PC1 and PC2 should be pairwise uncorrelated, as illustrated below. The value is approximately just 0.
cor(pc_esteem.10$x[,c(1)],pc_esteem.10$x[,c(2)])## [1] 1.12e-14
e) Plot PVE (Proportion of Variance Explained) and summarize the plot.
Student Answer:
plot(summary(pc_esteem.10)$importance[2, ], ylab="PVE",
xlab="Number of PC's",
main="Scree Plot of PVE for NLSY79")Summary of analysis: Take the number of PC’s when there is a sharp drop in the proportion of variance as number of PC’s increase from 1 to 2. It indicates that two leading PC’s should be sufficient, since beginning from PC3, the proportion of variance of each additional PC is less than 0.1.
f) Also plot CPVE (Cumulative Proportion of Variance Explained). What proportion of the variance in the data is explained by the first two principal components?
Student Answer: This is another criterion we may use to determine the number of PC’s to use.
plot(summary(pc_esteem.10)$importance[3, ], pch=16, ylab="Cumulative PVE",
xlab="Number of PC's",main="Scree Plot of Cumulative PVE for NLSY79")From the plot of cumulative PVE, we see that about 60% of the variance in data is explained by the first two principal components. Hence, if we want to be really conservative (say, make sure 90% of cumulative PVE is covered), then using 6 PC should suffice.
g) PC’s provide us with a low dimensional view of the self-esteem scores. Use a biplot with the first two PC's to display the data. Give an interpretation of PC1 and PC2 from the plot. (try `ggbiplot` if you could, much prettier!)
Student Answer: Visualize the PC scores together with loadings of original variables. We may take a look at the correlation structure from it:
lim <- c(-.10, .10)
biplot(pc_esteem.10,
xlim=lim,
ylim=lim,
main="Biplot of the PC's")
abline(v=0, h=0)In this plot, x-axis is PC1, y-axis is PC2, and top-axis indicates the proportion to loadings for PC1, and right y-axis indicates proportion to loadings for PC2.
Interpretation of PC1 and PC2:
If total scores are comparable, higher PC2 (and lower PC1) implies strong esteem response from the last three tests while lower PC2 (and higher PC1) implies vibrant esteem response from the first seven tests.
We observe that the last three questions tend to actually be the respondents’ euphemisms of wishing to be more achieving and hence become more self-esteemed. Thus, it seems that higher PC2 implies stronger feelings of the need to become more self-esteemed.
Apply k-means to cluster subjects on the original esteem scores
Student answer:
Elbow rule: The method consists of plotting the explained variation as a function of the number of clusters, and picking the elbow of the curve as the number of clusters to use
## minimize total intra-cluster variation
## k-means clustering
set.seed(0)
# function to compute total within-cluster sum of square
# the function kmeans directly evokes the package to use it
wss_esteem <- function(esteem_data, k) {
kmeans(esteem_data, k, nstart = 10)$tot.withinss
}
# we want to test out the within-cluster sum of square to be 1 to 15
k.values_esteem <- 2:15
# now we apply within-cluster values to be
wss_values_esteem <- sapply(k.values_esteem,function(k) kmeans(esteem_data[,-1], centers = k)$tot.withinss)
## then we plot the total within-clusters sum of squares against number of clusters
##
plot(k.values_esteem, wss_values_esteem,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares_Esteem")And furthermore, as demonstrated in the class, it may be even easier to decide the optimal number of clusters by using the factoextra package.
library('factoextra')## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_nbclust(esteem_data[,-1], kmeans, method = "wss")Notice how increasing the k value from 1 to 2 drastically lowers the average distance of each data point to its respective centroid, but as we continue increasing k the improvements begin to decrease asymptotically. The ideal k value is found at the elbow of such a plot, here picking k=3 is appropriate.
b) Can you summarize common features within each cluster?
Student answer: below we find the k-means data such that we can see the details within each cluster.
set.seed(130) # want to make sure our data is replicable
esteem.kmeans <- esteem_data %>%
kmeans(centers = 3)
str(esteem.kmeans)## List of 9
## $ cluster : int [1:2431] 1 1 3 1 3 3 3 2 3 3 ...
## $ centers : num [1:3, 1:10] 3.87 3.12 3.88 3.84 3.09 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:3] "1" "2" "3"
## .. ..$ : chr [1:10] "Esteem87_1" "Esteem87_2" "Esteem87_3" "Esteem87_4" ...
## $ totss : num 8775
## $ withinss : num [1:3] 2139 1683 1201
## $ tot.withinss: num 5024
## $ betweenss : num 3751
## $ size : int [1:3] 784 822 825
## $ iter : int 3
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
### Summarise esteem scores by clusters
esteem_data$cluster <- esteem.kmeans$cluster
esteem_data %>% group_by(cluster) %>%
summarise(across(everything(), list(mean)))
esteem_data %>% group_by(cluster) %>%
summarise(across(everything(), list(median)))
esteem_data %>% group_by(cluster) %>%
summarise(across(everything(), list(min)))
esteem_data %>% group_by(cluster) %>%
summarise(across(everything(), list(max)))
esteem_data %>% group_by(cluster) %>%
summarise(across(everything(), list(quantile)))## `summarise()` has grouped output by 'cluster'. You can override using the `.groups` argument.
## # A tibble: 3 x 11
## cluster Esteem87_1_1 Esteem87_2_1 Esteem87_3_1 Esteem87_4_1 Esteem87_5_1
## * <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 3.87 3.84 3.71 3.63 3.62
## 2 2 3.12 3.09 3.10 3.05 3.01
## 3 3 3.88 3.88 3.93 3.81 3.96
## # … with 5 more variables: Esteem87_6_1 <dbl>, Esteem87_7_1 <dbl>,
## # Esteem87_8_1 <dbl>, Esteem87_9_1 <dbl>, Esteem87_10_1 <dbl>
## # A tibble: 3 x 11
## cluster Esteem87_1_1 Esteem87_2_1 Esteem87_3_1 Esteem87_4_1 Esteem87_5_1
## * <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 4 4 4 4 4
## 2 2 3 3 3 3 3
## 3 3 4 4 4 4 4
## # … with 5 more variables: Esteem87_6_1 <dbl>, Esteem87_7_1 <dbl>,
## # Esteem87_8_1 <dbl>, Esteem87_9_1 <dbl>, Esteem87_10_1 <dbl>
## # A tibble: 3 x 11
## cluster Esteem87_1_1 Esteem87_2_1 Esteem87_3_1 Esteem87_4_1 Esteem87_5_1
## * <int> <dbl> <dbl> <int> <dbl> <int>
## 1 1 3 3 1 2 1
## 2 2 1 1 1 1 1
## 3 3 2 3 1 1 1
## # … with 5 more variables: Esteem87_6_1 <dbl>, Esteem87_7_1 <dbl>,
## # Esteem87_8_1 <int>, Esteem87_9_1 <int>, Esteem87_10_1 <int>
## # A tibble: 3 x 11
## cluster Esteem87_1_1 Esteem87_2_1 Esteem87_3_1 Esteem87_4_1 Esteem87_5_1
## * <int> <dbl> <dbl> <int> <dbl> <int>
## 1 1 4 4 4 4 4
## 2 2 4 4 4 4 4
## 3 3 4 4 4 4 4
## # … with 5 more variables: Esteem87_6_1 <dbl>, Esteem87_7_1 <dbl>,
## # Esteem87_8_1 <int>, Esteem87_9_1 <int>, Esteem87_10_1 <int>
## # A tibble: 15 x 11
## # Groups: cluster [3]
## cluster Esteem87_1_1 Esteem87_2_1 Esteem87_3_1 Esteem87_4_1 Esteem87_5_1
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 3 3 1 2 1
## 2 1 4 4 3 3 3
## 3 1 4 4 4 4 4
## 4 1 4 4 4 4 4
## 5 1 4 4 4 4 4
## 6 2 1 1 1 1 1
## 7 2 3 3 3 3 3
## 8 2 3 3 3 3 3
## 9 2 3 3 3 3 3
## 10 2 4 4 4 4 4
## 11 3 2 3 1 1 1
## 12 3 4 4 4 4 4
## 13 3 4 4 4 4 4
## 14 3 4 4 4 4 4
## 15 3 4 4 4 4 4
## # … with 5 more variables: Esteem87_6_1 <dbl>, Esteem87_7_1 <dbl>,
## # Esteem87_8_1 <dbl>, Esteem87_9_1 <dbl>, Esteem87_10_1 <dbl>
NLSY79$cluster <- esteem.kmeans$cluster
NLSY79 %>% group_by(cluster) %>% summarise(across(everything(), list(mean)))## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA
## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA
## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA
## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA
## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA
## Warning in mean.default(col, ...): argument is not numeric or logical: returning
## NA
NLSY79 %>% group_by(cluster) %>% summarise(mean(Income87),mean(Gender),mean(AFQT),mean(MotherEd),mean(FatherEd))## Warning in mean.default(Gender): argument is not numeric or logical: returning
## NA
## Warning in mean.default(Gender): argument is not numeric or logical: returning
## NA
## Warning in mean.default(Gender): argument is not numeric or logical: returning
## NA
## # A tibble: 3 x 47
## cluster Subject_1 Gender_1 Education05_1 Income87_1 Job05_1 Income05_1
## * <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 3598. NA 14.2 12633. NA 48429.
## 2 2 3527. NA 13.0 11555. NA 37906.
## 3 3 3393. NA 14.6 15964. NA 61818.
## # … with 40 more variables: Weight05_1 <dbl>, HeightFeet05_1 <dbl>,
## # HeightInch05_1 <dbl>, Imagazine_1 <dbl>, Inewspaper_1 <dbl>,
## # Ilibrary_1 <dbl>, MotherEd_1 <dbl>, FatherEd_1 <dbl>,
## # FamilyIncome78_1 <dbl>, Science_1 <dbl>, Arith_1 <dbl>, Word_1 <dbl>,
## # Parag_1 <dbl>, Number_1 <dbl>, Coding_1 <dbl>, Auto_1 <dbl>, Math_1 <dbl>,
## # Mechanic_1 <dbl>, Elec_1 <dbl>, AFQT_1 <dbl>, Esteem81_1_1 <dbl>,
## # Esteem81_2_1 <dbl>, Esteem81_3_1 <dbl>, Esteem81_4_1 <dbl>,
## # Esteem81_5_1 <dbl>, Esteem81_6_1 <dbl>, Esteem81_7_1 <dbl>,
## # Esteem81_8_1 <dbl>, Esteem81_9_1 <dbl>, Esteem81_10_1 <dbl>,
## # Esteem87_1_1 <dbl>, Esteem87_2_1 <dbl>, Esteem87_3_1 <dbl>,
## # Esteem87_4_1 <dbl>, Esteem87_5_1 <dbl>, Esteem87_6_1 <dbl>,
## # Esteem87_7_1 <dbl>, Esteem87_8_1 <dbl>, Esteem87_9_1 <dbl>,
## # Esteem87_10_1 <dbl>
## # A tibble: 3 x 6
## cluster `mean(Income87)` `mean(Gender)` `mean(AFQT)` `mean(MotherEd)`
## * <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 12633. NA 57.7 11.8
## 2 2 11555. NA 44.2 11.1
## 3 3 15964. NA 62.2 12.2
## # … with 1 more variable: `mean(FatherEd)` <dbl>
(Disregard the NA, since Job is actually character-filled.)
Discussion:
From the code above, we basically see comparing cluster 1, 2, 3, we see that cluster 2 respondents tend to have a lower score in AFQT, lower mean income in 1987, and lower family education (both parents). This group represents the cluster with lower socioeconomic conditions and status.
By contrast, we see that cluster 3 respondents tend to have higher AFQT scores, higher mean income, and higher family education (both parents). This group represents the cluster with higher socioeconomic conditions and status.
In addition, we tend to see that the PC1, PC2 scores tend to concentrate across the same centroid in each of the cluster. Specifically, we tend to find that groups in cluster 3 tend to have the largest PC1, PC2 scores, and cluster 2 tends to have the smallest PC1, PC2 scores. The further details of what characteristics PC1, and PC2 represent will be further elaborated in part(c) and (d).
c) Can you visualize the clusters with somewhat clear boundaries? You may try different pairs of variables and different PC pairs of the esteem scores.
Student answer:
We can try to impose the cluster labels over PC1 vs PC2 plot.
K-means separates groups by PC1. Pairing up a variable with the PC pairs of the esteem scores can generate some valuable grouping insights.
For example,
pair1: Education with PC1, PC2
data.frame(esteem = NLSY79$Education05, pc1 = pc_esteem.10$x[,c(1)], pc2 = pc_esteem.10$x[,c(2)], group = as.factor(esteem.kmeans$cluster)) %>%
ggplot(aes(x = pc1, y = pc2, col = group)) + geom_point() + ggrepel::geom_text_repel(aes(label = esteem))+ ggtitle("Clustering over PC1 and PC2")## Warning: ggrepel: 2343 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
From this pairing, we see that people with more education tend to fall into cluster 3, which has a relatively high (positive) PC1 score but a low (negative) PC2 score. People with less education tend to be more likely grouped into cluster 1, which has a relatively low (negative) PC1 score but a high (positive) PC2 score. However, this correlation with clustering group is not very strong as seen from the plot, since there are many people with higher years of education also lie in cluster 1.
pair2: Gender with PC1, PC2
data.frame(esteem = NLSY79$Gender, pc1 = pc_esteem.10$x[,c(1)], pc2 = pc_esteem.10$x[,c(2)], group = as.factor(esteem.kmeans$cluster)) %>%
ggplot(aes(x = pc1, y = pc2, col = group)) + geom_point()+ ggrepel::geom_text_repel(aes(label = esteem))+ ggtitle("Clustering over PC1 and PC2")## Warning: ggrepel: 2391 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Here we see that women are way more likely to tend to fall in cluster 3(lower(negative) PC1 score) more likely than male, and male are more likely to fall in cluster 1 (high PC1 score). In addition, it seems that women and men do not significantly differ in PC2 scores, but men generally have higher PC1 scores. This might explain that at year of 1987 when the experiement was carried out, women are far less empowered than today so they have lower level of self-esteem reported.
pair3: Income with PC1, PC2
data.frame(esteem = NLSY79$Income87, pc1 = pc_esteem.10$x[,c(1)], pc2 = pc_esteem.10$x[,c(2)], group = as.factor(esteem.kmeans$cluster)) %>%
ggplot(aes(x = pc1, y = pc2, col = group)) + geom_point()+ ggrepel::geom_text_repel(aes(label = esteem))+ ggtitle("Clustering over PC1 and PC2")## Warning: ggrepel: 2387 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
From this plot, we don’t find significant relationships between income and clustering of esteem PC scores, neither PC1 nor PC2. This suggests that income may not necessarily have strong impacts on level of esteem.
pair4: Weight with PC1, PC2
data.frame(esteem = NLSY79$Weight05, pc1 = pc_esteem.10$x[,c(1)], pc2 = pc_esteem.10$x[,c(2)], group = as.factor(esteem.kmeans$cluster)) %>%
ggplot(aes(x = pc1, y = pc2, col = group)) + geom_point() + ggrepel::geom_text_repel(aes(label = esteem))+ ggtitle("Clustering over PC1 and PC2")## Warning: ggrepel: 2380 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Note that we may find the three plots to be the same, since we are conducting the clustering using PC1, PC2, but remember that in each group, we are evaluating the different variables’ values (education, income, weight).
And hence we find that respondents with higher weights might be more likely to fall into cluster 1 or cluster 3, which have relatively lower PC scores. This may explain that higher weights might harm the level of self-esteem. There’s no clear relationship between weight and PC2 scores’ clustering effects.
We now try to find out what factors are related to self-esteem? PC1 of all the Esteem scores is a good variable to summarize one’s esteem scores. We take PC1 as our response variable.
Personal information: gender, education, log(income), job type, Body mass index as a measure of health (The BMI is defined as the body mass divided by the square of the body height, and is universally expressed in units of kg/m²)
Household environment: Imagazine, Inewspaper, Ilibrary, MotherEd, FatherEd, FamilyIncome78. Do set indicators Imagazine and Ilibrary as factors
Use PC1 of SVABS as level of intelligence
Student Answer:
Use PC1 of SVABS as level of intelligence is a proxy for intelligence variables. SVABS is just an acceptable estimation/respresentation of intelligence.
The preparation of variables as following:
# job type: it's character
class(NLSY79$Job05)
length(unique(NLSY79$Job05))
NLSY79$Job05[NLSY79$Job05=="4700 TO 4960: Sales and Related Workers"] <-1
NLSY79$Job05[NLSY79$Job05=="10 TO 430: Executive, Administrative and Managerial Occupations"] <-2
NLSY79$Job05[NLSY79$Job05=="7900 TO 8960: Setters, Operators and Tenders"] <-3
NLSY79$Job05[NLSY79$Job05=="5000 TO 5930: Office and Administrative Support Workers"] <-4
NLSY79$Job05[NLSY79$Job05=="4200 TO 4250: Cleaning and Building Service Occupations"] <-5
NLSY79$Job05[NLSY79$Job05=="6200 TO 6940: Construction Trade and Extraction Workers"] <-6
NLSY79$Job05[NLSY79$Job05=="1300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians"] <-7
NLSY79$Job05[NLSY79$Job05=="1800 TO 1860: Social Scientists and Related Workers"] <-8
NLSY79$Job05[NLSY79$Job05=="7000 TO 7620: Installation, Maintenance and Repairs Workers"] <-9
NLSY79$Job05[NLSY79$Job05=="3000 TO 3260: Health Diagnosing and Treating Practitioners"] <-10
NLSY79$Job05[NLSY79$Job05=="1000 TO 1240: Mathematical and Computer Scientists"] <-11
NLSY79$Job05[NLSY79$Job05=="3300 TO 3650: Health Care Technical and Support Occupations"] <-12
NLSY79$Job05[NLSY79$Job05=="500 TO 950: Management Related Occupations"] <-13
NLSY79$Job05[NLSY79$Job05=="4500 TO 4650: Personal Care and Service Workers"] <-14
NLSY79$Job05[NLSY79$Job05=="2200 TO 2340: Teachers"] <-15
NLSY79$Job05[NLSY79$Job05=="2800 TO 2960: Media and Communications Workers"] <-16
NLSY79$Job05[NLSY79$Job05=="2400 TO 2550: Education, Training and Library Workers"] <-17
NLSY79$Job05[NLSY79$Job05=="2100 TO 2150: Lawyers, Judges and Legal Support Workers"] <-18
NLSY79$Job05[NLSY79$Job05=="4000 TO 4160: Food Preparation and Serving Related Occupations"]<-19
NLSY79$Job05[NLSY79$Job05=="3700 TO 3950: Protective Service Occupations"]<-20
NLSY79$Job05[NLSY79$Job05=="9000 TO 9750: Transportation and Material Moving Workers"]<-21
NLSY79$Job05[NLSY79$Job05=="2000 TO 2060: Counselors, Sociala and Religious Workers"]<-22
NLSY79$Job05[NLSY79$Job05==""]<-23 # unspecified (missing values)
NLSY79$Job05[NLSY79$Job05=="7700 TO 7750: Production and Operating Workers"]<-24
NLSY79$Job05[NLSY79$Job05=="6000 TO 6130: Farming, Fishing and Forestry Occupations"]<-25
NLSY79$Job05[NLSY79$Job05=="2600 TO 2760: Entertainers and Performers, Sports and Related Workers"]<-26
NLSY79$Job05[NLSY79$Job05=="1900 TO 1960: Life, Physical and Social Science Technicians"]<-27
NLSY79$Job05[NLSY79$Job05=="4300 TO 4430: Entertainment Attendants and Related Workers"]<-28
NLSY79$Job05[NLSY79$Job05=="1600 TO 1760: Physical Scientists"]<-29
NLSY79$Job05[NLSY79$Job05=="7800 TO 7850: Food Preparation Occupations"]<-30
NLSY79$Job05[NLSY79$Job05=="9990: Uncodeable"]<-31
NLSY79$Job05<-as.factor(NLSY79$Job05)
## we may want to treat job as categorical variable of types
## however, even it's true that we may convert this characteristic job variable into 31 types, however, we may hardly tell that what's the rationale to give them values. In the regression model of part 6.(b), we will not include this job variable except for the model with largest number ## [1] "character"
## [1] 31
Imagazine<-as.factor(NLSY79$Imagazine)
Ilibrary<-as.factor(NLSY79$Ilibrary)## BMI: the body mass divided by the square of the body height, a measure of health
## BMI Formula: divide your weight (in pounds) by your height (in inches) x your height (in inches) and then multiply by 703
## someone with height 5'5'' has HeightFeet05=5, and HeightInch05=5
# 12*NLSY79$HeightFeet05+NLSY79$HeightInch05 is the height of a person in inches
BMI<-703*NLSY79$Weight05/(12*NLSY79$HeightFeet05+NLSY79$HeightInch05)^2## SVABS test scores include
SVABS_score<-NLSY79[,c(16:25)]
pc_SVABS.10<-prcomp(SVABS_score,scale=FALSE) ## by default, center=True but scale=FALSE
pc1_SVABS<-pc_SVABS.10$x[,c(1)]b) Run a few regression models between PC1 of all the esteem scores and factors listed in a). Find a final best model with your own criterion.
- How did you land this model? Run a model diagnosis to see if the linear model assumptions are reasonably met.
- Write a summary of your findings. In particular, explain what and how the variables in the model affect one's self-esteem.
Student Answer:
We present the multivariate regression model of pc1_esteem vs all factors except for job (since there are 31 categories):
Then we will evaluate the model choice by considering: 1. if the regressors are significant 2. if the residuals, namely R-squared, is acceptably small
## response variable PC1 as the summary of one's esteem scores
model1_esteem<-lm(pc1_esteem~NLSY79$Gender+NLSY79$Education05+NLSY79$Income87+NLSY79$MotherEd+NLSY79$FatherEd+NLSY79$Inewspaper+NLSY79$FamilyIncome78+Imagazine+Ilibrary+BMI+pc1_SVABS)
summary(model1_esteem)##
## Call:
## lm(formula = pc1_esteem ~ NLSY79$Gender + NLSY79$Education05 +
## NLSY79$Income87 + NLSY79$MotherEd + NLSY79$FatherEd + NLSY79$Inewspaper +
## NLSY79$FamilyIncome78 + Imagazine + Ilibrary + BMI + pc1_SVABS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.522 -0.959 -0.067 1.001 2.937
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.74e+00 2.16e-01 -8.04 1.4e-15 ***
## NLSY79$Gendermale 1.20e-01 5.28e-02 2.27 0.02339 *
## NLSY79$Education05 7.61e-02 1.23e-02 6.21 6.1e-10 ***
## NLSY79$Income87 1.29e-05 2.33e-06 5.52 3.7e-08 ***
## NLSY79$MotherEd 1.77e-02 1.28e-02 1.38 0.16667
## NLSY79$FatherEd 4.17e-03 9.52e-03 0.44 0.66159
## NLSY79$Inewspaper 1.62e-01 7.93e-02 2.05 0.04091 *
## NLSY79$FamilyIncome78 1.44e-06 2.00e-06 0.72 0.47221
## Imagazine1 5.19e-02 6.15e-02 0.84 0.39887
## Ilibrary1 9.16e-02 6.30e-02 1.46 0.14575
## BMI -3.15e-03 3.82e-03 -0.82 0.40978
## pc1_SVABS 6.32e-03 1.63e-03 3.88 0.00011 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.23 on 2419 degrees of freedom
## Multiple R-squared: 0.106, Adjusted R-squared: 0.102
## F-statistic: 26.1 on 11 and 2419 DF, p-value: <2e-16
From the results above, we see that the R-squared is very low: about only 0.106. And the residual standard error is also very low, and the p-value of the entire model is minimal. Thus, we see that this model fit overall is very good.
But if we look at the linearity assumptions individually, we notice that father’s education, Family Income, BMI are clearly not significant. Mother’s education and posession of library card are only mildly significant.
And it seems that except for BMI, all these variables are positively associated with the PC1 of the 10 esteem scores.
so we want to make some modifications. We remove the assumption for these insignificant variables, and consider a restricted model:
## restricted model1 after dropping the obviously insignificant variables
model1_esteem_restricted<-lm(pc1_esteem~NLSY79$Gender+NLSY79$Education05+NLSY79$Income87+NLSY79$MotherEd+NLSY79$Inewspaper+Imagazine+Ilibrary+pc1_SVABS)
summary(model1_esteem_restricted)##
## Call:
## lm(formula = pc1_esteem ~ NLSY79$Gender + NLSY79$Education05 +
## NLSY79$Income87 + NLSY79$MotherEd + NLSY79$Inewspaper + Imagazine +
## Ilibrary + pc1_SVABS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.520 -0.957 -0.070 1.008 2.919
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.84e+00 1.84e-01 -9.96 < 2e-16 ***
## NLSY79$Gendermale 1.19e-01 5.25e-02 2.27 0.023 *
## NLSY79$Education05 7.77e-02 1.21e-02 6.42 1.7e-10 ***
## NLSY79$Income87 1.30e-05 2.31e-06 5.65 1.8e-08 ***
## NLSY79$MotherEd 2.20e-02 1.13e-02 1.95 0.051 .
## NLSY79$Inewspaper 1.69e-01 7.87e-02 2.15 0.032 *
## Imagazine1 6.20e-02 6.08e-02 1.02 0.308
## Ilibrary1 9.50e-02 6.26e-02 1.52 0.129
## pc1_SVABS 6.54e-03 1.62e-03 4.05 5.3e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.23 on 2422 degrees of freedom
## Multiple R-squared: 0.105, Adjusted R-squared: 0.102
## F-statistic: 35.7 on 8 and 2422 DF, p-value: <2e-16
And comparing the R-squared, p-value, we find that except for Magazine (its significance decreased after we impose restrictions on model1), all the other variables remain very significant, such as the person’s income and education (which intuitively makes sense). But this model doesn’t substantially improve the overall significance.
But with model1 and model1_restricted, we can already tell that the personal information has a lot more significant influence on self-esteem compared to the person’s family upbringing.
Or, we may consider trying to consider the interaction of parents’ education variable. This can potentially be helpful since they are both 0-1 variables, and reflect the family upbringing of individuals.
parental_educ<-NLSY79$MotherEd*NLSY79$FatherEd
model2_esteem<-lm(pc1_esteem~NLSY79$Gender+NLSY79$Education05+NLSY79$Income87+parental_educ+NLSY79$Inewspaper+NLSY79$FamilyIncome78+Imagazine+Ilibrary+BMI+pc1_SVABS)
summary(model2_esteem)##
## Call:
## lm(formula = pc1_esteem ~ NLSY79$Gender + NLSY79$Education05 +
## NLSY79$Income87 + parental_educ + NLSY79$Inewspaper + NLSY79$FamilyIncome78 +
## Imagazine + Ilibrary + BMI + pc1_SVABS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.530 -0.952 -0.061 1.009 2.937
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.60e+00 2.05e-01 -7.84 6.7e-15 ***
## NLSY79$Gendermale 1.20e-01 5.28e-02 2.28 0.023 *
## NLSY79$Education05 7.74e-02 1.23e-02 6.27 4.2e-10 ***
## NLSY79$Income87 1.29e-05 2.33e-06 5.54 3.4e-08 ***
## parental_educ 6.06e-04 4.98e-04 1.22 0.224
## NLSY79$Inewspaper 1.75e-01 7.87e-02 2.22 0.026 *
## NLSY79$FamilyIncome78 1.58e-06 2.00e-06 0.79 0.430
## Imagazine1 5.69e-02 6.14e-02 0.93 0.354
## Ilibrary1 9.61e-02 6.29e-02 1.53 0.127
## BMI -3.29e-03 3.82e-03 -0.86 0.389
## pc1_SVABS 6.47e-03 1.62e-03 3.98 7.1e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.23 on 2420 degrees of freedom
## Multiple R-squared: 0.105, Adjusted R-squared: 0.102
## F-statistic: 28.5 on 10 and 2420 DF, p-value: <2e-16
This model2 seems to improve the significance on parental education level. But it has not improved other variables’ significance as well.
Lastly, we try the most complicated model: where we see the influence of each type of job categories (Casted in part 6.(a)).
## response variable PC1 as the summary of one's esteem scores
model3_esteem<-lm(pc1_esteem~NLSY79$Gender+NLSY79$Education05+NLSY79$Income87+NLSY79$MotherEd+NLSY79$FatherEd+NLSY79$Inewspaper+NLSY79$FamilyIncome78+Imagazine+Ilibrary+BMI+pc1_SVABS+NLSY79$Job05)
summary(model3_esteem)
Anova(model3_esteem)##
## Call:
## lm(formula = pc1_esteem ~ NLSY79$Gender + NLSY79$Education05 +
## NLSY79$Income87 + NLSY79$MotherEd + NLSY79$FatherEd + NLSY79$Inewspaper +
## NLSY79$FamilyIncome78 + Imagazine + Ilibrary + BMI + pc1_SVABS +
## NLSY79$Job05)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.204 -0.904 -0.042 0.984 3.142
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.41e+00 2.47e-01 -5.73 1.1e-08 ***
## NLSY79$Gendermale 1.63e-01 6.06e-02 2.69 0.0072 **
## NLSY79$Education05 6.11e-02 1.36e-02 4.49 7.6e-06 ***
## NLSY79$Income87 1.15e-05 2.34e-06 4.89 1.1e-06 ***
## NLSY79$MotherEd 1.59e-02 1.28e-02 1.24 0.2141
## NLSY79$FatherEd 3.01e-03 9.49e-03 0.32 0.7512
## NLSY79$Inewspaper 1.82e-01 7.94e-02 2.29 0.0220 *
## NLSY79$FamilyIncome78 1.08e-06 1.99e-06 0.54 0.5872
## Imagazine1 2.19e-02 6.13e-02 0.36 0.7206
## Ilibrary1 7.79e-02 6.28e-02 1.24 0.2149
## BMI -4.48e-03 3.81e-03 -1.17 0.2402
## pc1_SVABS 5.22e-03 1.65e-03 3.17 0.0015 **
## NLSY79$Job0510 2.10e-01 1.70e-01 1.23 0.2174
## NLSY79$Job0511 2.13e-01 1.76e-01 1.21 0.2266
## NLSY79$Job0512 -3.58e-01 1.52e-01 -2.36 0.0183 *
## NLSY79$Job0513 2.88e-01 1.46e-01 1.97 0.0494 *
## NLSY79$Job0514 2.63e-02 2.08e-01 0.13 0.8994
## NLSY79$Job0515 -2.07e-02 1.48e-01 -0.14 0.8887
## NLSY79$Job0516 5.05e-02 3.50e-01 0.14 0.8851
## NLSY79$Job0517 -4.30e-02 2.44e-01 -0.18 0.8602
## NLSY79$Job0518 1.58e-02 3.29e-01 0.05 0.9617
## NLSY79$Job0519 -3.58e-01 1.73e-01 -2.07 0.0383 *
## NLSY79$Job052 1.67e-01 1.07e-01 1.56 0.1182
## NLSY79$Job0520 4.17e-01 1.87e-01 2.23 0.0261 *
## NLSY79$Job0521 -3.39e-01 1.44e-01 -2.35 0.0190 *
## NLSY79$Job0522 -1.01e-01 2.12e-01 -0.48 0.6336
## NLSY79$Job0523 -2.08e-01 1.85e-01 -1.13 0.2593
## NLSY79$Job0524 -6.54e-02 1.95e-01 -0.34 0.7376
## NLSY79$Job0525 -4.10e-01 4.19e-01 -0.98 0.3278
## NLSY79$Job0526 4.86e-01 2.64e-01 1.84 0.0657 .
## NLSY79$Job0527 -2.52e-02 4.69e-01 -0.05 0.9571
## NLSY79$Job0528 -9.65e-01 3.96e-01 -2.44 0.0149 *
## NLSY79$Job0529 -1.20e+00 6.17e-01 -1.95 0.0517 .
## NLSY79$Job053 -8.27e-02 1.46e-01 -0.57 0.5708
## NLSY79$Job0530 -1.96e-01 6.16e-01 -0.32 0.7504
## NLSY79$Job0531 -4.04e-01 1.22e+00 -0.33 0.7409
## NLSY79$Job054 6.52e-02 1.09e-01 0.60 0.5501
## NLSY79$Job055 -4.02e-01 1.75e-01 -2.30 0.0216 *
## NLSY79$Job056 -2.21e-01 1.40e-01 -1.58 0.1132
## NLSY79$Job057 -2.76e-01 1.89e-01 -1.46 0.1438
## NLSY79$Job058 -5.20e-01 5.07e-01 -1.03 0.3052
## NLSY79$Job059 -1.07e-01 1.48e-01 -0.72 0.4693
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.22 on 2389 degrees of freedom
## Multiple R-squared: 0.133, Adjusted R-squared: 0.118
## F-statistic: 8.9 on 41 and 2389 DF, p-value: <2e-16
##
## Anova Table (Type II tests)
##
## Response: pc1_esteem
## Sum Sq Df F value Pr(>F)
## NLSY79$Gender 11 1 7.23 0.0072 **
## NLSY79$Education05 30 1 20.12 7.6e-06 ***
## NLSY79$Income87 36 1 23.95 1.1e-06 ***
## NLSY79$MotherEd 2 1 1.54 0.2141
## NLSY79$FatherEd 0 1 0.10 0.7512
## NLSY79$Inewspaper 8 1 5.25 0.0220 *
## NLSY79$FamilyIncome78 0 1 0.29 0.5872
## Imagazine 0 1 0.13 0.7206
## Ilibrary 2 1 1.54 0.2149
## BMI 2 1 1.38 0.2402
## pc1_SVABS 15 1 10.06 0.0015 **
## NLSY79$Job05 109 30 2.44 2.2e-05 ***
## Residuals 3548 2389
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see that this model has one advantage: it allows us to see the impact of different careers on the workers’ level of self-esteem. The influence of careers is significant. But if we perform ANOVA on all these 31 job categorical variables, then each of the result isn’t significant. Thus, we can say job category has a significant influence, yet we need more information to get specific results of each variable’s influence.
In addition, this model has higher residual standard error and higher multiple R-squared. The number of insignificant variables also increase. For this reason, it’s not preferred.
In summary, across model1, model1_restricted, model2, model3, we see that we would prefer model1_restricted as described above.
As a summary, we see that Gender (being male)+years of education, income, parental education,newspaper, usage of magazine, library, and personal intelligence are positively associated with self-esteem.
The Cancer Genome Atlas (TCGA), a landmark cancer genomics program by National Cancer Institute (NCI), molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types. The genome data is open to public from the Genomic Data Commons Data Portal (GDC).
In this study, we focus on 4 sub-types of breast cancer (BRCA): basal-like (basal), Luminal A-like (lumA), Luminal B-like (lumB), HER2-enriched. The sub-type is based on PAM50, a clinical-grade luminal-basal classifier.
We will try to use mRNA expression data alone without the labels to classify 4 sub-types. Classification without labels or prediction without outcomes is called unsupervised learning. We will use K-means and spectrum clustering to cluster the mRNA data and see whether the sub-type can be separated through mRNA data.
We first read the data using data.table::fread() which is a faster way to read in big data than read.csv().
Student Answer: 208 patients for Basal sub-type, 91 for Her2, 628 for LumA, and 233 for LumB.
brca_subtype %>%
group_by(BRCA_Subtype_PAM50) %>%
summarise(n())## # A tibble: 4 x 2
## BRCA_Subtype_PAM50 `n()`
## * <chr> <int>
## 1 Basal 208
## 2 Her2 91
## 3 LumA 628
## 4 LumB 233
set.seed(2048)
n_gene <- 5
sample_idx <- sample(ncol(brca), n_gene)
plot_data <- select(brca, all_of(sample_idx))
plot_data$TYPE <- brca_subtype$BRCA_Subtype_PAM50
plot_data %>%
pivot_longer(cols = 1:n_gene) %>%
group_by(TYPE, name) %>%
ggplot(aes(x = value)) +
geom_histogram(aes(y=..density.., fill = name)) +
facet_grid(name ~ TYPE, scales = "free") +
labs(x = 'Value', y = 'Density') +
theme_bw() +
theme(legend.position = "none")sel_cols <- which((colSums(abs(brca)) != 0) & (sapply(brca, var) > 1e-10))
brca_sub <- brca[, sel_cols, with=F]
brca_sub <- log2(brca_sub + 1e-10)brca_subtype and the cluster labels.brca_sub_kmeans <- kmeans(x = brca_sub, centers = 4, nstart = 50)
table(brca_subtype$BRCA_Subtype_PAM50, brca_sub_kmeans$cluster)##
## 1 2 3 4
## Basal 17 1 187 3
## Her2 9 43 16 23
## LumA 86 395 0 147
## LumB 22 104 2 105
irlba::irlba().Student Answer: The top 4 PCs seem to explain a lot more variance than the rest of the PCs with a clear elbow point at 4 PCs. Thus 4 PCs would be a reasonable choice, although we might also want to take into account of the performance of clustering.
pca_ct_sc <- prcomp(brca_sub, center = T, scale. = T)
pca_ct <- prcomp(brca_sub, center = T, scale. = F)
pve <- summary(pca_ct_sc)$importance[2, 1:10]
data.frame("index"=1:10, "pve"=pve) %>%
ggplot(aes(x = index, y = pve)) +
geom_point() + geom_line() +
labs(title = "PVE Centered and Scaled", x = '# of PCs', y = 'Proportion of Variance Explained') +
scale_x_continuous(breaks = 1:10) +
theme_bw()gridExtra::grid.arrange() or ggpubr::ggrrange() or egg::ggrrange() for ggplots; use fig.show="hold" as chunk option for base plots)Student Answer: We should NOT scale for clustering purpose. Since from the two plots below, the unscaled PCs can produce much better separation of different cancer types (especially the difference between Basal and Her2 Type).
plot_1 <-
data.table(x = pca_ct_sc$x[, 1],
y = pca_ct_sc$x[, 2],
label = as.factor(brca_subtype$BRCA_Subtype_PAM50)) %>%
ggplot() +
geom_point(aes(x = x, y = y, col = label)) +
labs(title = "Centered and Scaled", x = 'PC 1', y = 'PC 2')
plot_2 <-
data.table(x = pca_ct$x[, 1],
y = pca_ct$x[, 2],
label = as.factor(brca_subtype$BRCA_Subtype_PAM50),
cluster = as.factor(brca_sub_kmeans$cluster)) %>%
ggplot() +
geom_point(aes(x = x, y = y, col = label)) +
labs(title = "Centered and Unscaled", x = 'PC 1', y = 'PC 2')
grid.arrange(plot_1, plot_2, ncol = 2)Student Answer: K = 4 is a reasonable number for cluster according to the elbow rule (although in our particular case the location of the “elbow” is slightly ambiguous).
wss <- function(df, k)
{
kmeans(df, k, nstart = 10)$tot.withinss
}
num_k <- 2:15
wss_values <- map_dbl(num_k, function(k) wss(pca_ct$x[, 1:4], k))
data.frame("k" = num_k, "wss" = wss_values) %>%
ggplot(aes(x = k, y = wss)) +
geom_point() + geom_line() +
labs(x = '# of Clusters K', y = 'WSS')Student Answer: The clustering algorithm can successfully distinguish between the Basal vs. LumA/B subtype. A large portion of LumA subtype is also isolated from the rest. However, the Her2 subtype cannot be well separated from the rest, and there are also many mislabeling within LumA/B subtype.
n_cluster <- 4
n_PCs <- 4
pca_ct_kmeans <- kmeans(x = pca_ct$x[, 1:n_PCs], centers = n_cluster, nstart = 100)
centroids <- data.frame(pca_ct_kmeans$centers[, 1:2])
centroids$cluster = as.factor(1:n_cluster)
plot_data <- data.table(x = pca_ct$x[, 1],
y = pca_ct$x[, 2],
label = as.factor(brca_subtype$BRCA_Subtype_PAM50),
cluster = as.factor(pca_ct_kmeans$cluster))
ggplot() +
geom_point(data = plot_data, aes(x = x, y = y, col = label, shape = cluster)) +
geom_point(data = centroids, aes(x = PC1, y = PC2, shape = cluster)) +
labs(title = "Centered and Unscaled", x = 'PC 1', y = 'PC 2')Student Answer: In this case, PCA did NOT help in kmeans clustering in terms of the quality of the cluster. The only improvement is that the kmeans algorithm runs much faster with only 4 PCs. In general, PCA might help clustering since it picks up dimensions that are more important, thus can help reduce the noise in clustering.
plot_1 <-
ggplot() +
geom_point(data = plot_data, aes(x = x, y = y, col = label, shape = cluster)) +
geom_point(data = centroids, aes(x = PC1, y = PC2, shape = cluster)) +
labs(title = "kmeans with 4 PCs", x = 'PC 1', y = 'PC 2')
plot_2 <-
data.table(x = pca_ct$x[, 1],
y = pca_ct$x[, 2],
label = as.factor(brca_subtype$BRCA_Subtype_PAM50),
cluster = as.factor(brca_sub_kmeans$cluster)) %>%
ggplot() +
geom_point(aes(x = x, y = y, col = label, shape = cluster)) +
labs(title = "kmeans with Original Data", x = 'PC 1', y = 'PC 2')
grid.arrange(plot_1, plot_2, ncol = 2)print("Kmeans with Original Data")## [1] "Kmeans with Original Data"
table(brca_subtype$BRCA_Subtype_PAM50, brca_sub_kmeans$cluster)##
## 1 2 3 4
## Basal 17 1 187 3
## Her2 9 43 16 23
## LumA 86 395 0 147
## LumB 22 104 2 105
print("Kmeans with 4PCs")## [1] "Kmeans with 4PCs"
table(brca_subtype$BRCA_Subtype_PAM50, pca_ct_kmeans$cluster)##
## 1 2 3 4
## Basal 187 1 17 3
## Her2 28 27 9 27
## LumA 0 376 91 161
## LumB 2 99 22 110
Student Answer: This patient is most likely to have the Basal subtype cancer. The Euclidean distance to each of the centroid is 122, 469, 496, 508, respectively.
x_patient <- fread("data/brca_x_patient.csv")
x_patient <- log2(x_patient[, sel_cols, with=F] + 1e-10) - pca_ct$center
patient = data.frame("PC1" = as.matrix(x_patient[1, ]) %*% as.matrix(pca_ct$rotation[, 1]),
"PC2" = as.matrix(x_patient[1, ]) %*% as.matrix(pca_ct$rotation[, 2]))
ggplot() +
geom_point(data = plot_data, aes(x = x, y = y, col = label, shape = cluster)) +
geom_point(data = centroids, aes(x = PC1, y = PC2, shape = cluster)) +
geom_point(data = patient, aes(x = PC1, y = PC2), shape = 21, size = 4, fill = 'red') +
ggrepel::geom_label_repel(data = patient, aes(x = PC1, y = PC2), label = "Patient", size = 4) +
labs(title = "Centered and Unscaled, with New Patient", x = 'PC 1', y = 'PC 2')for (idx in 1:4)
{
print(norm(as.matrix(centroids[idx, 1:2]) - as.matrix(patient[1, ])))
}## [1] 122
## [1] 496
## [1] 508
## [1] 469
This question utilizes the Auto dataset from ISLR. The original dataset contains 408 observations about cars. It is similar to the CARS dataset that we use in our lectures. To get the data, first install the package ISLR. The Auto dataset should be loaded automatically. We’ll use this dataset to practice the methods learn so far. Original data source is here: https://archive.ics.uci.edu/ml/datasets/auto+mpg
Get familiar with this dataset first. Tip: you can use the command ?ISLR::Auto to view a description of the dataset.
library(ISLR)
?ISLR::Auto
auto_data <- ISLR::Auto
auto_data$origin <- as.factor(auto_data$origin)Explore the data, with particular focus on pairwise plots and summary statistics. Briefly summarize your findings and any peculiarities in the data.
Student Answer:
Out of the nine columns, only the column name and origin should contain factor type data; the other seven columns all contain numeric type data.
The distribution for the response variable, mpg, is right-skewed. Five out of the seven numeric variables’ distributions are right skewed.
There are 301 names for 392 vehicles – almost every vehicle has a unique name.
While the range of number of cylinders for this dataset is 3 to 8, there is no vehicle with 7 cylinders.
cylinders, displacement, horsepower, and weight show trends of high positive correlation with each other (all correlections between the four variables are larger than 0.7).
str(auto_data)## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : num 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
summary(auto_data)## mpg cylinders displacement horsepower weight
## Min. : 9.0 Min. :3.00 Min. : 68 Min. : 46.0 Min. :1613
## 1st Qu.:17.0 1st Qu.:4.00 1st Qu.:105 1st Qu.: 75.0 1st Qu.:2225
## Median :22.8 Median :4.00 Median :151 Median : 93.5 Median :2804
## Mean :23.4 Mean :5.47 Mean :194 Mean :104.5 Mean :2978
## 3rd Qu.:29.0 3rd Qu.:8.00 3rd Qu.:276 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.6 Max. :8.00 Max. :455 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.0 Min. :70 1:245 amc matador : 5
## 1st Qu.:13.8 1st Qu.:73 2: 68 ford pinto : 5
## Median :15.5 Median :76 3: 79 toyota corolla : 5
## Mean :15.5 Mean :76 amc gremlin : 4
## 3rd Qu.:17.0 3rd Qu.:79 amc hornet : 4
## Max. :24.8 Max. :82 chevrolet chevette: 4
## (Other) :365
length(unique(auto_data$name))
unique(auto_data$cylinders)## [1] 301
## [1] 8 4 6 3 5
library(skimr)
skim(auto_data)| Name | auto_data |
| Number of rows | 392 |
| Number of columns | 9 |
| _______________________ | |
| Column type frequency: | |
| factor | 2 |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| origin | 0 | 1 | FALSE | 3 | 1: 245, 3: 79, 2: 68 |
| name | 0 | 1 | FALSE | 301 | amc: 5, for: 5, toy: 5, amc: 4 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| mpg | 0 | 1 | 23.45 | 7.81 | 9 | 17.0 | 22.8 | 29 | 46.6 | ▆▇▆▃▁ |
| cylinders | 0 | 1 | 5.47 | 1.71 | 3 | 4.0 | 4.0 | 8 | 8.0 | ▇▁▃▁▅ |
| displacement | 0 | 1 | 194.41 | 104.64 | 68 | 105.0 | 151.0 | 276 | 455.0 | ▇▂▂▃▁ |
| horsepower | 0 | 1 | 104.47 | 38.49 | 46 | 75.0 | 93.5 | 126 | 230.0 | ▆▇▃▁▁ |
| weight | 0 | 1 | 2977.58 | 849.40 | 1613 | 2225.2 | 2803.5 | 3615 | 5140.0 | ▇▇▅▅▂ |
| acceleration | 0 | 1 | 15.54 | 2.76 | 8 | 13.8 | 15.5 | 17 | 24.8 | ▁▆▇▂▁ |
| year | 0 | 1 | 75.98 | 3.68 | 70 | 73.0 | 76.0 | 79 | 82.0 | ▇▆▇▆▇ |
hist(auto_data$mpg)auto_data %>% select_if(is.numeric) %>% cor()## mpg cylinders displacement horsepower weight acceleration
## mpg 1.000 -0.778 -0.805 -0.778 -0.832 0.423
## cylinders -0.778 1.000 0.951 0.843 0.898 -0.505
## displacement -0.805 0.951 1.000 0.897 0.933 -0.544
## horsepower -0.778 0.843 0.897 1.000 0.865 -0.689
## weight -0.832 0.898 0.933 0.865 1.000 -0.417
## acceleration 0.423 -0.505 -0.544 -0.689 -0.417 1.000
## year 0.581 -0.346 -0.370 -0.416 -0.309 0.290
## year
## mpg 0.581
## cylinders -0.346
## displacement -0.370
## horsepower -0.416
## weight -0.309
## acceleration 0.290
## year 1.000
library(reshape2)##
## Attaching package: 'reshape2'
## The following objects are masked from 'package:data.table':
##
## dcast, melt
## The following object is masked from 'package:tidyr':
##
## smiths
corr.table <- auto_data %>% select_if(is.numeric) %>% cor() %>% reshape2::melt()
corr.table %>% ggplot(aes(x=Var1, y=Var2, fill=value)) + geom_tile() +
xlab("") +
ylab("") +
guides(fill = guide_legend(title = "Correlation"))library(GGally)## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
auto_data %>% select_if(is.numeric) %>% ggpairs()time have on MPG?mpg vs. year and report R’s summary output. Is year a significant variable at the .05 level? State what effect year has on mpg, if any, according to this model.Student Answer:
After running a regression of mpg vs. year, the summary result is below. Yes, year is a significant variable at the .05 level.
Interpretation: according to this model, when the model year (modulo 100) increase by 1 unit (1 year), we expect on average the vehicle’s miles per gallon value will increase by 1.23.
lm_auto1 <- lm(mpg ~ year, data = auto_data)
summary(lm_auto1)##
## Call:
## lm(formula = mpg ~ year, data = auto_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.021 -5.441 -0.441 4.974 18.209
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -70.0117 6.6452 -10.5 <2e-16 ***
## year 1.2300 0.0874 14.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.36 on 390 degrees of freedom
## Multiple R-squared: 0.337, Adjusted R-squared: 0.335
## F-statistic: 198 on 1 and 390 DF, p-value: <2e-16
horsepower on top of the variable year to your linear model. Is year still a significant variable at the .05 level? Give a precise interpretation of the year’s effect found here.Student Answer: Yes, year is still a significant variable at the .05 level.
Intepretation: If the horsepower of the vehicle is held constant, when the model year (modulo 100) increase by 1 unit (1 year), we expect on average the vehicle’s miles per gallon value will increase by 0.65.
lm_auto2 <- lm(mpg ~ year+horsepower, data = auto_data)
summary(lm_auto2)##
## Call:
## lm(formula = mpg ~ year + horsepower, data = auto_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.077 -3.078 -0.431 2.588 15.315
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12.73917 5.34903 -2.38 0.018 *
## year 0.65727 0.06626 9.92 <2e-16 ***
## horsepower -0.13165 0.00634 -20.76 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.39 on 389 degrees of freedom
## Multiple R-squared: 0.685, Adjusted R-squared: 0.684
## F-statistic: 424 on 2 and 389 DF, p-value: <2e-16
Student Answer:
The difference: The difference lies in whether the horsepower variable is held constant or not when estimating the 95% CI for the coefficient of year.
The 95% CI in (i) estimates the coefficient of year when not considering the variation of other variables. For example, regardless of how other variables change, the 95% CI for the coefficient of year is (1.06, 1.40). (note: the year effect is not purly from year; it is much more complex in this case)
On the other hand, the 95% CI in (ii) estimates the coefficient of year when the horsepower variable is held constant. For example, when the horsepowers of the vehicles are held constant, the 95% CI for the coefficient of year is (0.527, 0.728).
confint(lm_auto1,level = 0.95)## 2.5 % 97.5 %
## (Intercept) -83.08 -56.9
## year 1.06 1.4
confint(lm_auto2,level = 0.95)## 2.5 % 97.5 %
## (Intercept) -23.256 -2.223
## year 0.527 0.788
## horsepower -0.144 -0.119
lm(mpg ~ year * horsepower). Is the interaction effect significant at .05 level? Explain the year effect (if any).Student: With a p-value <2e-16, we can argue that the interaction effect is significant at .05 level.
Year effect: If the horsepower if vehicles and the interaction between the horsepower and the year are held constant, when the model year (modulo 100) increase by 1 unit (1 year), we expect on average the vehicle’s miles per gallon value will increase by 2.19.
lm_auto3 <- lm(mpg ~ year * horsepower, data = auto_data)
summary(lm_auto3)
anova(lm_auto2,lm_auto3)##
## Call:
## lm(formula = mpg ~ year * horsepower, data = auto_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.349 -2.451 -0.456 2.406 14.444
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.27e+02 1.21e+01 -10.45 <2e-16 ***
## year 2.19e+00 1.61e-01 13.59 <2e-16 ***
## horsepower 1.05e+00 1.15e-01 9.06 <2e-16 ***
## year:horsepower -1.60e-02 1.56e-03 -10.22 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.9 on 388 degrees of freedom
## Multiple R-squared: 0.752, Adjusted R-squared: 0.75
## F-statistic: 393 on 3 and 388 DF, p-value: <2e-16
##
## Analysis of Variance Table
##
## Model 1: mpg ~ year + horsepower
## Model 2: mpg ~ year * horsepower
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 389 7491
## 2 388 5903 1 1588 104 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Remember that the same variable can play different roles! Take a quick look at the variable cylinders, and try to use this variable in the following analyses wisely. We all agree that a larger number of cylinders will lower mpg. However, we can interpret cylinders as either a continuous (numeric) variable or a categorical variable.
cylinders as a continuous/numeric variable. Is cylinders significant at the 0.01 level? What effect does cylinders play in this model?Student Answer: Yes, with coefficient p-value < 2e-16, cylinders is significant at the 0.01 level.
Effect: When the number of cylinders of the vehicle increase by 1, we expect on average the vehicle’s miles per gallon value will decrease by 3.558.
lm_auto4 <- lm(mpg ~ cylinders, data = auto_data)
summary(lm_auto4)##
## Call:
## lm(formula = mpg ~ cylinders, data = auto_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.241 -3.183 -0.633 2.549 17.917
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.916 0.835 51.4 <2e-16 ***
## cylinders -3.558 0.146 -24.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.91 on 390 degrees of freedom
## Multiple R-squared: 0.605, Adjusted R-squared: 0.604
## F-statistic: 597 on 1 and 390 DF, p-value: <2e-16
cylinders as a categorical/factor. Is cylinders significant at the .01 level? What is the effect of cylinders in this model? Describe the cylinders effect over mpg.Student Answer:
Only when the number of cylinders is 4, with p-value 0.00027, the cylinders, or the cylinders4, is significant at the 0.01 level. When the number of cylinders is 5, 6, and 8, cylinders is not significant at the 0.01 level. The effect of cylinders is different based on different number of cylinders for the vehicle.
Effect: In this model, in which we only consider number of cylinders as (categorical) variables, when the number of cylinders for a vehicle is 4, compared with when the number of cylinders is 3, this vehicle’s miles per gallon on average will be 8.734 higher in value.
When the number of cylinders for a vehicle is 5, compared with when the number of cylinders is 3, this vehicle’s miles per gallon on average will be 6.817 higher in value.
When the number of cylinders for a vehicle is 6, compared with when the number of cylinders is 3, this vehicle’s miles per gallon on average will be 0.577 lower in value.
When the number of cylinders for a vehicle is 8, compared with when the number of cylinders is 3, this vehicle’s miles per gallon on average will be 5.587 lower in value.
lm_auto5 <- lm(mpg ~ as.factor(cylinders), data = auto_data)
summary(lm_auto5)##
## Call:
## lm(formula = mpg ~ as.factor(cylinders), data = auto_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.284 -2.904 -0.963 2.344 18.027
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.550 2.349 8.75 < 2e-16 ***
## as.factor(cylinders)4 8.734 2.373 3.68 0.00027 ***
## as.factor(cylinders)5 6.817 3.589 1.90 0.05825 .
## as.factor(cylinders)6 -0.577 2.405 -0.24 0.81071
## as.factor(cylinders)8 -5.587 2.395 -2.33 0.02015 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.7 on 387 degrees of freedom
## Multiple R-squared: 0.641, Adjusted R-squared: 0.638
## F-statistic: 173 on 4 and 387 DF, p-value: <2e-16
cylinders as a continuous and categorical variable in your models?Student Answer:
When treating cylinders as a continuous variable, there is only one cylinders variable in the models; when treating cylinders as a categorical variable, each factor level (each number of cylinders) other than the base (cylinders = 3) become a variable in the models.
mpg is linear in cylinders vs. fit1: mpg relates to cylinders as a categorical variable at .01 level?Student Answer:
Using anova with Chi-square test, with p-value 3.4e-08, at 0.01 level, we reject the null hypothesis and conclude that fit0: mpg is linear in cylinders vs. fit1: mpg does not relate to cylinders as a categorical variable.
summary(lm_auto4)
summary(lm_auto5)
anova(lm_auto4,lm_auto5)##
## Call:
## lm(formula = mpg ~ cylinders, data = auto_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.241 -3.183 -0.633 2.549 17.917
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.916 0.835 51.4 <2e-16 ***
## cylinders -3.558 0.146 -24.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.91 on 390 degrees of freedom
## Multiple R-squared: 0.605, Adjusted R-squared: 0.604
## F-statistic: 597 on 1 and 390 DF, p-value: <2e-16
##
##
## Call:
## lm(formula = mpg ~ as.factor(cylinders), data = auto_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.284 -2.904 -0.963 2.344 18.027
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.550 2.349 8.75 < 2e-16 ***
## as.factor(cylinders)4 8.734 2.373 3.68 0.00027 ***
## as.factor(cylinders)5 6.817 3.589 1.90 0.05825 .
## as.factor(cylinders)6 -0.577 2.405 -0.24 0.81071
## as.factor(cylinders)8 -5.587 2.395 -2.33 0.02015 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.7 on 387 degrees of freedom
## Multiple R-squared: 0.641, Adjusted R-squared: 0.638
## F-statistic: 173 on 4 and 387 DF, p-value: <2e-16
##
## Analysis of Variance Table
##
## Model 1: mpg ~ cylinders
## Model 2: mpg ~ as.factor(cylinders)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 390 9416
## 2 387 8544 3 871 13.2 3.4e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Final modeling question: we want to explore the effects of each feature as best as possible. You may explore interactions, feature transformations, higher order terms, or other strategies within reason. The model(s) should be as parsimonious (simple) as possible unless the gain in accuracy is significant from your point of view.
Describe the final model. Include diagnostic plots with particular focus on the model residuals and diagnoses.
Summarize the effects found.
Predict the mpg of the following car: A red car built in the US in 1983 that is 180 inches long, has eight cylinders, displaces 350 cu. inches, weighs 4000 pounds, and has a horsepower of 260. Also give a 95% CI for your prediction.
Student Answer:
Final model: keeping mpg in numeric as the response variable, we include cylinders in factors, origin in factors, and year in numeric as exploratory variables into the linear regression model.
Examining the several assumptions for linear regressions, first we didn’t include displacement, weight, and horsepower to avoid multicollinearity based on the correlation matrix in the EDA part. We also remove variables that show p-value > 0.05 and eventually reach the final model below. Second, looking at the residual vs. fitted plot, we can argue that our final model satisfies the homoscedasticity mostly. Third, looking at the normal QQ plot, we can argue that our final model satisfies the normality assumption mostly, with acceptable deviation in the higher quantile range.
Cylinders: If the other variables are held constant, when the number of cylinders of the vehicle is 4,5,6,8, we expect on average the vehicle’s miles per gallon value will increase by 9.47, 6.20, 2.40, -0.74 respectively.
Origin: If the other variables are held constant are held constant, when the vehicle origin is by European and Japanese, compared with when the vehicle origin is American, we expect on average the vehicle’s miles per gallon value will increase by 1.66 and 3.65 respecitvely.
Year: Weight: If the other variables are held constant are held constant, when the model year of the vehicle increase by 1 unit, we expect on average the vehicle’s miles per gallon value will increase by 0.74.
Prediction: MPG = 21.7 /miles per gallon; 95% CI for prediction: (20.5, 23) /miles per gallon
library(car)
final_model <- lm(mpg ~ as.factor(cylinders)+as.factor(origin)+year, data = auto_data)
## cylinders+displacement+horsepower+weight+acceleration+year+origin
summary(final_model)
Anova(final_model)##
## Call:
## lm(formula = mpg ~ as.factor(cylinders) + as.factor(origin) +
## year, data = auto_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.782 -2.214 -0.416 2.060 13.870
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -39.2808 4.7010 -8.36 1.2e-15 ***
## as.factor(cylinders)4 9.4719 1.9347 4.90 1.4e-06 ***
## as.factor(cylinders)5 6.2021 2.9595 2.10 0.0368 *
## as.factor(cylinders)6 2.3960 1.9987 1.20 0.2314
## as.factor(cylinders)8 -0.7457 2.0076 -0.37 0.7105
## as.factor(origin)2 1.6631 0.6285 2.65 0.0085 **
## as.factor(origin)3 3.6529 0.5898 6.19 1.5e-09 ***
## year 0.7441 0.0566 13.16 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.76 on 384 degrees of freedom
## Multiple R-squared: 0.772, Adjusted R-squared: 0.768
## F-statistic: 186 on 7 and 384 DF, p-value: <2e-16
##
## Anova Table (Type II tests)
##
## Response: mpg
## Sum Sq Df F value Pr(>F)
## as.factor(cylinders) 4590 4 81.1 < 2e-16 ***
## as.factor(origin) 543 2 19.2 1.2e-08 ***
## year 2449 1 173.1 < 2e-16 ***
## Residuals 5433 384
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow=c(1,2), mar=c(5,2,4,2), mgp=c(3,0.5,0)) # plot(fit3) produces several plots
plot(final_model, 1, pch=16) # residual plot
abline(h=0, col="blue", lwd=2)
plot(final_model, 2) # qqplotpred_car = data.frame(displacement = 350, cylinders = as.factor(8), origin = as.factor(1), horsepower = 260, weight = 4000, year = 83)
pred <- predict(final_model,newdata = pred_car,interval = "confidence",se.fit = TRUE)
pred## $fit
## fit lwr upr
## 1 21.7 20.5 23
##
## $se.fit
## [1] 0.634
##
## $df
## [1] 384
##
## $residual.scale
## [1] 3.76
### Other exploration
# rmse = function(a,b) { sqrt(mean((a-b)^2)) }
#
# library(leaps)
# fit.exh <- regsubsets(mpg ~as.factor(cylinders)+displacement+horsepower+weight+acceleration+year+as.factor(origin), auto_data , nvmax=25, method="exhaustive")
#
# f.e <- summary(fit.exh)
# plot(f.e$cp, xlab="Number of predictors", ylab="Cp", col="red", pch=16)
#
# opt.size <- which.min(f.e$cp)
# opt.size
# fit.exh.var <- f.e$which
# colnames(fit.exh.var)[fit.exh.var[opt.size,]]This exercise is designed to help you understand the linear model using simulations. In this exercise, we will generate \((x_i, y_i)\) pairs so that all linear model assumptions are met.
Presume that \(\mathbf{x}\) and \(\mathbf{y}\) are linearly related with a normal error \(\boldsymbol{\varepsilon}\) , such that \(\mathbf{y} = 1 + 1.2\mathbf{x} + \boldsymbol{\varepsilon}\). The standard deviation of the error \(\varepsilon_i\) is \(\sigma = 2\).
We can create a sample input vector (\(n = 40\)) for \(\mathbf{x}\) with the following code:
# Generates a vector of size 40 with equally spaced values between 0 and 1, inclusive
x <- seq(0, 1, length = 40)Create a corresponding output vector for \(\mathbf{y}\) according to the equation given above. Use set.seed(1). Then, create a scatterplot with \((x_i, y_i)\) pairs. Base R plotting is acceptable, but if you can, please attempt to use ggplot2 to create the plot. Make sure to have clear labels and sensible titles on your plots.
set.seed(1)
y <- 1 + 1.2 * x + rnorm(length(x), mean = 0, sd = 2.0)
data.frame("x" = x, "y" = y) %>%
ggplot(aes(x = x, y = y)) +
geom_point((aes(color = "Data"))) +
labs(title = "scatter plot of x - y", x = "x", y = "y")lm() function. What are the true values of \(\boldsymbol{\beta}_0\) and \(\boldsymbol{\beta}_1\)? Do the estimates look to be good?Student Answer: \(\hat{\boldsymbol{\beta}}_0 = 1.331, \hat{\boldsymbol{\beta}}_1 = 0.906\). The true values: \(\boldsymbol{\beta}_0 = 1.0, \boldsymbol{\beta}_1 = 1.2\). The estimates are not particularly good due to large amount of noise.
fit <- lm(y ~ x, data.frame("x" = x, "y" = y))
fit$coefficients## (Intercept) x
## 1.331 0.906
Student Answer: \(RSE = 1.79\). It is relatively close to \(\sigma = 2\).
summary(fit)##
## Call:
## lm(formula = y ~ x, data = data.frame(x = x, y = y))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.662 -0.880 0.014 1.247 2.882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.331 0.557 2.39 0.022 *
## x 0.906 0.959 0.95 0.350
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.79 on 38 degrees of freedom
## Multiple R-squared: 0.023, Adjusted R-squared: -0.00272
## F-statistic: 0.894 on 1 and 38 DF, p-value: 0.35
Student Answer: The \(95\%\) interval for \(\boldsymbol{\beta}_1\) is \([-1.034, 2.85]\). The interval captures the true \(\boldsymbol{\beta}_1 = 1.2\).
confint(fit)## 2.5 % 97.5 %
## (Intercept) 0.203 2.46
## x -1.034 2.85
y_true <- 1 + 1.2 * x
data.frame("x" = x, "y" = y) %>%
ggplot(aes(x = x, y = y)) +
geom_point(aes(color = "Data")) +
geom_smooth(aes(color = "LS Estimate"), method = "lm", formula = y ~ x, size = 0.5) +
geom_line(data = data.frame("x" = x, "y" = y_true), aes(color = "True Line")) +
labs(title = "relationship between x - y", x = "x", y = "y")data.frame("y" = fit$fitted.values, "res" = fit$residuals) %>%
ggplot(aes(x = y, y = res)) +
geom_point() +
labs(title = "scatter plot of fitted y - residuals", x = "fitted y", y = "residuals")qqnorm(fit$residuals)
qqline(fit$residuals)Student Answer: From i), we can see that the homoscedasticity assumption is well met by the data; And from ii), we can see that the normality assumption is well met by the data.
This part aims to help you understand the notion of sampling statistics and confidence intervals. Let’s concentrate on estimating the slope only.
Generate 100 samples of size \(n = 40\), and estimate the slope coefficient from each sample. We include some sample code below, which should guide you in setting up the simulation. Note: this code is easier to follow but suboptimal; see the appendix for a more optimal R-like way to run this simulation.
lm_sim <- function(n_sim)
{
# Inializing variables. Note b_1, upper_ci, lower_ci are vectors
x <- seq(0, 1, length = 40)
b1 <- 0 # n_sim many LS estimates of beta_1 (=1.2). Initialize to 0 for now
upper_ci <- 0 # upper bound for beta_1. Initialize to 0 for now.
lower_ci <- 0 # lower bound for beta_1. Initialize to 0 for now.
t_star <- qt(0.975, 38) # Food for thought: why 38 instead of 40? What is t_star?
# Perform the simulation
for (i in 1:n_sim){
y <- 1 + 1.2 * x + rnorm(40, sd = 2.0)
lse <- lm(y ~ x)
lse_output <- summary(lse)$coefficients
se <- lse_output[2, 2]
b1[i] <- lse_output[2, 1]
upper_ci[i] <- b1[i] + t_star * se
lower_ci[i] <- b1[i] - t_star * se
}
results <- as.data.frame(cbind(se, b1, upper_ci, lower_ci))
return(results)
}results$b1). Does the sampling distribution agree with theory?Student Answer: The sampling distribution agrees well with the theory: \(\boldsymbol{\beta}_1\) is Gaussian distributed, with a mean around \(1.2\) (i.e., unbiased) and a variance around \(\frac{\sigma^2}{\sum_{i=1}^{n}(x_i - \bar{x})^2} = 1.14\). (We have increased n_sim to 10,000 to get a more accurate number.)
results <- lm_sim(10000)
ggplot(results) +
geom_histogram(aes(x = b1), binwidth = 0.2)qqnorm(results$b1)
qqline(results$b1)mean(results$b1)## [1] 1.22
var(results$b1)## [1] 1.14
Student Answer: Out of 100 simulations, the 95% CI for about 96 of them capture the true \(\boldsymbol{\beta}_1\) (of course this number can vary slightly from each run due to randomness in the sampling, here we have fixed the seed = 1 for replication purposes).
set.seed(1)
results <- lm_sim(100)
sum(results$lower_ci < 1.2 & results$upper_ci >= 1.2)## [1] 96
results <-
arrange(results, lower_ci, upper_ci)
results$index = 1:100
ggplot(results) +
geom_pointrange(aes(x = index, y = b1, ymin = lower_ci, ymax = upper_ci)) +
geom_hline(yintercept=1.2, linetype="dashed", color = "red", size=1) +
labs(x = '# simulation (sorted)', y = 'b1 and 95% CI')